Submodel Selection and Evaluation in Regression - the X - Random Case
نویسندگان
چکیده
Often, in a regression situation with many variables, a sequence of submodels is generated containing fewer variables using such methods as stepwise addition as deletion of variables, or "best subsets". The question is which of this sequence of submodels is "best", and how can submodel perfornance be evaluated. This was explored in Breiman [1988] for a fixed X-design. This is a sequel exploring the case of random X-designs. Analytical results are difficult, if not impossible. This study involved an extensive simulation. The basis of the study is the theoretical definition of prediction error (PE) as the expected squared error produced by applying any prediction equation to the distributional universe of (y,x) values. This definition is used throughout to compare various submodels. There are startling differences between the x-fixed and x-random situations and different PE estimates are appropriate. Non-resampling estimates such as Cp, adjusted R2, etc. turn out to be highly biased and almost worthless methods for submodel selection. The two best methods are cross-validation and bootstrap. One surpnse is that 5 fold cross-validation (leave out 20% of the data) is better at submodel selection and evaluation than leave-one-out cross-validation. There are a number of other surprises. * Work supported by NSF Grant No. DMS-8718362. Dans l'analyse de problemes de regression 'a plusieurs variables (independentes), on produit souvent une serie de sous-modeles constitues d'un sous-ensemble des variables par des methodes tels que l'addition par etope, le retroit par e'tope et la methode du "meilleurs sous-ensemble". Le probl'eme est de determiner liquel de ces sousmod'eles est "le meilleux et d'evaluer sa performance. Ce probleme fut explore dans Breiman [1988] dans le cas d'une matrice X fixe. Dans ce qui suit, on consideire le cas de la matrice X etant aliatoire. La determination de resultats analytiques est dificile, si non impossible. Hors cet(te) etude implique des simulations de grande ervergure. cet(te) etude se base sur la d6finition theorique de l'erreur de prediction (PE) comme etant l'esperance du carre de l'erreur produite en applicant une equation de prediction 'a l'inverse distributionel des voleurs (y,x). cette definition est utilisee afin de comparen divers sous-modeles. La difference entre les cas de la matrice X fixe et aleatoire est remarkable et diff6rents estimateurs du PE s'appliquent. Les estimateurs n'utilisant pas de reechantillonage, tels que le Cp et le R2 ajuot6, produisent des methodes de selection grandement biaisees. Leo deux meilleurs methodes sant cross-validation et l'autoarmarcaze bootstrap. Une surprise est que S-fold cross-validation est mieux que leave-one-out cross-validation. I1 y a falusieurs outres resultats surprenants.
منابع مشابه
Regression quantiles and trimmed least squares estimator under a general design
where X = (X1, ...,X„)' is the vector of independent observations, C is the n x p design matrix, /} = (ftu ...,PP)' is the vector of unknown parameters and £ = = ( £ ] , . . . , E„)' where Eu ..., E„ are independent and identically distributed (i.i.d.) random variables with a continuous distribution function (d.f.) F. Our main interest is in robust estimating the parameter /?. For the location ...
متن کاملIdentification of Genetic Polymorphism Interactions in Sporadic Alzheimer’s Disease Using Logic Regression
Objectives: Genetic polymorphism interactions are among the important factors in affliction with complex diseases like Alzheimer’s disease. The important goal of genetic association studies is to identify a combination of polymorphisms and measure their importance in increasing the risk of occurrence of such diseases. In this study, feature selection approach of logic regression was used to ide...
متن کاملA Fuzzy Approach for Projects Evaluation and Selection an Iranian Auto Manufacturer Case Study
Evaluating and selecting alternatives investment projects needs considering all relevant and important aspects. In traditional methods, the focus is just on tangible monetary criteria. Also in the traditional methods, either all the information’s about factors must be known precisely or sufficient objective data must be available for applying probability theory. In this paper, a combinative app...
متن کاملA Universal Selection Method in Linear Regression Models
In this paper we consider a linear regression model with fixed design. A new rule for the selection of a relevant submodel is introduced on the basis of parameter tests. One particular feature of the rule is that subjective grading of the model complexity can be incorporated. We provide bounds for the mis-selection error. Simulations show that by using the proposed selection rule, the mis-selec...
متن کاملسودمندی رگرسیونهای تجمیعی و روشهای انتخاب متغیرهای پیشبین بهینه در پیشبینی بازده سهام
مقاله حاضر به بررسی سودمندی رگرسیونهای تجمیعی و روشهای انتخاب متغیرهای پیشبین بهینه (شامل روش مبتنی بر همبستگی و ریلیف) برای پیشبینی بازده سهام شرکتهای پذیرفته شده در بورس اوراق بهادار تهران میپردازد. بهمنظور ارزیابی عملکرد رگرسیون تجمیعی، معیارهای ارزیابی (شامل میانگین قدرمطلق درصد خطا، مجذور مربع میانگین خطا و ضریب تعیین) مربوط به پیشبینی این روش، با رگرسیون خطی و شبکههای عصبی مصنوعی...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2008